21 research outputs found

    Towards resource-aware computing for task-based runtimes and parallel architectures

    Get PDF
    Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. The increasing restrictions in terms of power consumption of High Performance Computing (HPC) systems and data centers have forced hardware vendors to include power capping capabilities in their commodity processors. Power capping opens up new opportunities for applications to directly manage their power behavior at user level. However, constraining power consumption causes the individual sockets of a parallel system to deliver different performance levels under the same power cap, even when they are equally designed, which is an effect caused by manufacturing variability. Modern chips suffer from heterogeneous power consumption due to manufacturing issues, a problem known as manufacturing or process variability. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this thesis we show that parallel systems benefit from taking into account the consequences of manufacturing variability, in terms of both performance and energy efficiency. In order to evaluate our work we have also implemented our own task-based version of the PARSEC benchmark suite. This allows to test our methodology using state-of-the-art parallelization techniques and real world workloads. We present two approaches to mitigate manufacturing variability, by power redistribution at runtime level and by power- and variability-aware job scheduling at system-wide level. A parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost. The next approach presented in this theis, we show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensures that power consumption stays under a system wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications.Los sistemas modernos de gran escala muestran crecientes demandas de energía, hasta el punto de que se ha convertido en una gran presión para las instalaciones y los presupuestos. Las restricciones crecientes de consumo de energía de los sistemas de alto rendimiento (HPC) y los centros de datos han obligado a los proveedores de hardware a incluir capacidades de limitación de energía en sus procesadores. La limitación de energía abre nuevas oportunidades para que las aplicaciones administren directamente su comportamiento de energía a nivel de usuario. Sin embargo, la restricción en el consumo de energía de sockets individuales de un sistema paralelo resulta en diferentes niveles de rendimiento, por el mismo límite de potencia, incluso cuando están diseñados por igual. Esto es un efecto causado durante el proceso de la fabricación. Los chips modernos sufren de un consumo de energía heterogéneo debido a problemas de fabricación, un problema conocido como variabilidad del proceso o fabricación. Como resultado, los sistemas que no consideran este tipo de variabilidad causada por problemas de fabricación conducen a degradaciones del rendimiento y desperdicio de energía. Para evitar dicho impacto negativo, los usuarios y administradores del sistema deben contrarrestar activamente cualquier variabilidad de fabricación. En esta tesis, demostramos que los sistemas paralelos se benefician de tener en cuenta las consecuencias de la variabilidad de la fabricación, tanto en términos de rendimiento como de eficiencia energética. Para evaluar nuestro trabajo, también hemos implementado nuestra propia versión del paquete de aplicaciones de prueba PARSEC, basada en tareas paralelos. Esto permite probar nuestra metodología utilizando técnicas avanzadas de paralelización con cargas de trabajo del mundo real. Presentamos dos enfoques para mitigar la variabilidad de fabricación, mediante la redistribución de la energía a durante la ejecución de las aplicaciones y mediante la programación de trabajos a nivel de todo el sistema. Se puede utilizar un sistema runtime paralelo para tratar con eficacia este nuevo tipo de heterogeneidad de rendimiento, compensando los efectos desiguales de la limitación de potencia. En el contexto de un nodo NUMA compuesto de varios sockets y núcleos, nuestro sistema puede optimizar los niveles de energía y concurrencia asignados a cada socket para maximizar el rendimiento. Aplicado de manera transparente dentro del sistema runtime paralelo, no requiere ninguna interacción del programador como cambiar el código fuente de la aplicación o reconfigurar manualmente el sistema paralelo. Comparamos nuestro novedoso análisis de runtime con los resultados óptimos, obtenidos de una análisis manual exhaustiva, y demostramos que puede lograr el mismo rendimiento a una fracción del costo. El siguiente enfoque presentado en esta tesis, muestra que es posible predecir el impacto de la variabilidad de fabricación en aplicaciones específicas mediante el uso de modelos de predicción de potencia conscientes de la variabilidad. Basados ​​en estos modelos de predicción de energía, proponemos dos políticas de programación de trabajos que consideran los efectos de la variabilidad de fabricación para cada aplicación y que aseguran que el consumo se mantiene bajo un presupuesto de energía de todo el sistema. Evaluamos nuestras políticas con diferentes presupuestos de energía y escenarios de tráfico, que consisten en aplicaciones paralelas que corren en uno o varios nodos.Postprint (published version

    Towards resource-aware computing for task-based runtimes and parallel architectures

    Get PDF
    Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. The increasing restrictions in terms of power consumption of High Performance Computing (HPC) systems and data centers have forced hardware vendors to include power capping capabilities in their commodity processors. Power capping opens up new opportunities for applications to directly manage their power behavior at user level. However, constraining power consumption causes the individual sockets of a parallel system to deliver different performance levels under the same power cap, even when they are equally designed, which is an effect caused by manufacturing variability. Modern chips suffer from heterogeneous power consumption due to manufacturing issues, a problem known as manufacturing or process variability. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this thesis we show that parallel systems benefit from taking into account the consequences of manufacturing variability, in terms of both performance and energy efficiency. In order to evaluate our work we have also implemented our own task-based version of the PARSEC benchmark suite. This allows to test our methodology using state-of-the-art parallelization techniques and real world workloads. We present two approaches to mitigate manufacturing variability, by power redistribution at runtime level and by power- and variability-aware job scheduling at system-wide level. A parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost. The next approach presented in this theis, we show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensures that power consumption stays under a system wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications.Los sistemas modernos de gran escala muestran crecientes demandas de energía, hasta el punto de que se ha convertido en una gran presión para las instalaciones y los presupuestos. Las restricciones crecientes de consumo de energía de los sistemas de alto rendimiento (HPC) y los centros de datos han obligado a los proveedores de hardware a incluir capacidades de limitación de energía en sus procesadores. La limitación de energía abre nuevas oportunidades para que las aplicaciones administren directamente su comportamiento de energía a nivel de usuario. Sin embargo, la restricción en el consumo de energía de sockets individuales de un sistema paralelo resulta en diferentes niveles de rendimiento, por el mismo límite de potencia, incluso cuando están diseñados por igual. Esto es un efecto causado durante el proceso de la fabricación. Los chips modernos sufren de un consumo de energía heterogéneo debido a problemas de fabricación, un problema conocido como variabilidad del proceso o fabricación. Como resultado, los sistemas que no consideran este tipo de variabilidad causada por problemas de fabricación conducen a degradaciones del rendimiento y desperdicio de energía. Para evitar dicho impacto negativo, los usuarios y administradores del sistema deben contrarrestar activamente cualquier variabilidad de fabricación. En esta tesis, demostramos que los sistemas paralelos se benefician de tener en cuenta las consecuencias de la variabilidad de la fabricación, tanto en términos de rendimiento como de eficiencia energética. Para evaluar nuestro trabajo, también hemos implementado nuestra propia versión del paquete de aplicaciones de prueba PARSEC, basada en tareas paralelos. Esto permite probar nuestra metodología utilizando técnicas avanzadas de paralelización con cargas de trabajo del mundo real. Presentamos dos enfoques para mitigar la variabilidad de fabricación, mediante la redistribución de la energía a durante la ejecución de las aplicaciones y mediante la programación de trabajos a nivel de todo el sistema. Se puede utilizar un sistema runtime paralelo para tratar con eficacia este nuevo tipo de heterogeneidad de rendimiento, compensando los efectos desiguales de la limitación de potencia. En el contexto de un nodo NUMA compuesto de varios sockets y núcleos, nuestro sistema puede optimizar los niveles de energía y concurrencia asignados a cada socket para maximizar el rendimiento. Aplicado de manera transparente dentro del sistema runtime paralelo, no requiere ninguna interacción del programador como cambiar el código fuente de la aplicación o reconfigurar manualmente el sistema paralelo. Comparamos nuestro novedoso análisis de runtime con los resultados óptimos, obtenidos de una análisis manual exhaustiva, y demostramos que puede lograr el mismo rendimiento a una fracción del costo. El siguiente enfoque presentado en esta tesis, muestra que es posible predecir el impacto de la variabilidad de fabricación en aplicaciones específicas mediante el uso de modelos de predicción de potencia conscientes de la variabilidad. Basados ​​en estos modelos de predicción de energía, proponemos dos políticas de programación de trabajos que consideran los efectos de la variabilidad de fabricación para cada aplicación y que aseguran que el consumo se mantiene bajo un presupuesto de energía de todo el sistema. Evaluamos nuestras políticas con diferentes presupuestos de energía y escenarios de tráfico, que consisten en aplicaciones paralelas que corren en uno o varios nodos

    Power efficient job scheduling by predicting the impact of processor manufacturing variability

    Get PDF
    Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft

    Runtime-guided mitigation of manufacturing variability in power-constrained multi-socket NUMA nodes

    Get PDF
    This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). This work was also partially performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-689878). Finally, the authors are grateful to the reviewers for their valuable comments, to the RoMoL team, to Xavier Teruel and Kallia Chronaki from the Programming Models group of BSC and the Computation Department of LLNL for their technical support and useful feedback.Peer ReviewedPostprint (published version

    The optimization of typical species inventory of habitat types of a NATURA 2000 site using a phytosociological approach

    Get PDF
    The definition of typical species inventories of the 92/43/EEC Directive habitat types is a valuable information for the optimization of the conservation status assessment. Habitat-specific assessment protocols and predefined local inventories of typical species provide a method for a relatively fast and accurate assessment of the criterion “structures and functions”. Habitat types are often defined and described on the basis of a phytosociological description of vegetation units, mainly at the syntaxonomical level of alliance. Therefore, the definition of typical species inventories can be based on phytosociological approaches. Within this concept we surveyed the vegetation of a NATURA 2000 Special Area of Conservation in northern Greece in order to optimize and downscale the existing region-wide inventories of typical species. In total, we sampled 164 relevés in beech and in thermophilous deciduous broadleaved forests. The relevés were assigned to vegetation units and habitat types using numerical approaches and their differential and constant taxa were defined. We used these taxa to draw up the optimized, site-specific inventories of typical species for seven habitat types of community interest and one habitat type of national interest

    PARSECSs: Evaluating the impact of task parallelism in the PARSEC benchmark suite

    Get PDF
    In this work, we show how parallel applications can be implemented efficiently using task parallelism. We also evaluate the benefits of such parallel paradigm with respect to other approaches. We use the PARSEC benchmark suite as our test bed, which includes applications representative of a wide range of domains from HPC to desktop and server applications. We adopt different parallelization techniques, tailored to the needs of each application, to fully exploit the task-based model. Our evaluation shows that task parallelism achieves better performance than thread-based parallelization models, such as Pthreads. Our experimental results show that we can obtain scalability improvements up to 42% on a 16-core system and code size reductions up to 81%. Such reductions are achieved by removing from the source code application specific schedulers or thread pooling systems and transferring these responsibilities to the runtime system software.This work has been partially supported by the European Research Council under the European Union 7th FP, ERC Grant Agreement number 321253, by the Spanish Ministry of Science and Innovation under grant TIN2012-34557, by the Severo Ochoa Program, awarded by the Spanish Government, under grant SEV-2011-00067 and by the HiPEAC Network of Excellence. M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva post-doctoral fellowship number JCI-2012-15047, and M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co-fund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). Finally, the authors are grateful to the reviewers for their valuable comments, to the people from the Programming Models Group at BSC for their technical support, to the RoMoL team, and to Xavier Teruel, Roger Ferrer and Paul Caheny for their help in this work.Peer ReviewedPostprint (author's final draft

    Evaluating the impact of OpenMP 4.0 extensions on relevant parallel workloads

    Get PDF
    OpenMP has been for many years the most widely used programming model for shared memory architectures. Periodically, new features are proposed and some of them are finally selected for inclusion in the OpenMP standard. The OmpSs programming model developed at the Barcelona Supercomputing Center (BSC) aims to be an OpenMP forerunner that handles the main OpenMP constructs plus some extra features not included in the OpenMP standard. In this paper we show the usefulness of three OmpSs features not currently handled by OpenMP 4.0 by deploying them over three applications of the PARSEC benchmark suite and showing the performance benefits. This paper also shows performance trade-offs between the OmpSs/OpenMP tasking and loop parallelism constructs and shows how a hybrid implementation that combines both approaches is sometimes the best option.This work has been partially supported by the European Research Council under the European Union's 7th FP, ERC Grant Agreement number 321253, by the Spanish Ministry of Science and Innovation under grant TIN2012-34557 and by the HiPEAC Network of Excellence. It has been also supported by the Severo Ochoa Program awarded by the Spanish Government (grant SEV-2011-00067) M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI- 2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co- fund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP_B 00243).Peer ReviewedPostprint (author's final draft

    Runtime-aware architectures

    Get PDF
    In the last few years, the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore’s Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores face. The runtime system of the parallel programming model has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In the paper, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime’s perspective.This work has been partially supported by the European Research Council under the European Union’s 7th FP, ERC Grant Agreement number 321253, by the Spanish Ministry of Science and Innovation under grant TIN2012-34557 and by the HiPEAC Network of Excellence. M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI- 2012-15047, and M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co-fund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft
    corecore